1. Introduction

This project is an exploratory data analysis of White Wine Quality to determine which features affect wine quality. The dataset was created in 2009 by Paulo Cortez, F. Almeida, T. Matos and J. Reis and contains wine preferences accompanied by their physicochemical properties. The Quality rating output is based on sensory data, where at least 3 evaluations were made by wine experts per wine sample.

The WhiteWine Dataset Contains the Following:

Unique identifier:
   1 - X
   
Input variables (based on physicochemical tests):
   2 - fixed acidity
   3 - volatile acidity
   4 - citric acid
   5 - residual sugar
   6 - chlorides
   7 - free sulfur dioxide
   8 - total sulfur dioxide
   9 - density
   10 - pH
   11 - sulphates
   12 - alcohol

Output variable (based on sensory data): 
   13 - quality (score between 0 and 10)
   
I will use these values to determine which features have the greatest affect on a Wine's quality rating.
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 4898
## [1] 13
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
A quick summary of the data shows that White Wine is acidic with a max pH level of 3.82, an avg. alcohol content of 10%, & a wine quality rating that usually falls between 5-6 on a scale of 10.

2. Exploratory Analysis

2.1 Univariate Plots Section

This bar plot of Wine Quality shows that most ratings fall between 5 and 7, which is consistent with the dataset's description that there are much more normal wines than excellent or poor ones.

## [1] "The alcohol level feature in the White Wine dataset appears to be positively skewed right."

## [1] "The distribution is skewed right."

## [1] "pH shows a bell like shape, so it appears to be normally distributed."

## [1] "Fixed Acidity also appears to be normally distributed."

## [1] "Volatile acidity in large quanitites can lead to an unpleasant vinegar taste in wine, I wonder wether the wine quality will correlate to this?"

## [1] "Distribution appears symmetric with a couple of outliers."

## [1] "The Chlorides distribution appears to be symmetric, non normal, &\nshort tailed."

## [1] "Free Sulfur Dioxide has a normal distribution."

## [1] "Free Sulfur Dioxide has a normal distribution."

## [1] "Density has a symmetric, non normal, & short tailed distribution."

## [1] "Sulphates can contribute to levels of sulphor dioxide in wine, so I expect their destributions to be related. However it more visually resembles the Chloride distribution. The Sulphates show a normal distribution."
The above plots show a smoothed version of a histogram for each input variable. Adjustments had to be made to most of the plots to display the data more clearly. The adjustments were made by setting scale limits for the (x,y) axis. The density estimates allow for more readable distributions.

Univariate Analysis

What is the structure of your dataset?

The csv dataset contains 4,898 observations with 13 features: X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, & quality. Two of the feature variables are integers with X containing a unique integer identifier, quality containing an integer as an output scale variable, and the remaining variables containing input numerical values of physical and chemical properties.

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature of interest. Of 11 input variables, I hope to determine which features influence the quality rating.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I predict that fixed.acidity, residual.sugar, & alcohol all contribute to the quality rating.

Did you create any new variables from existing variables in the dataset?

I did not create a new variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

No, I did not have to perform any operations on the data. The normal distributions shown worked for what I wanted to view.


2.2 Bivariate Plots Section

##       F.A   V.A   C.A  Sug  Chl   F.S  T.S  Den    pH   Sul   Alc
## F.A  1.00 -0.02  0.29 0.09 0.02 -0.05 0.09 0.27 -0.43 -0.02 -0.12
## V.A -0.02  1.00 -0.15 0.06 0.07 -0.10 0.09 0.03 -0.03 -0.04  0.07
## C.A  0.29 -0.15  1.00 0.09 0.11  0.09 0.12 0.15 -0.16  0.06 -0.08
## Sug  0.09  0.06  0.09 1.00 0.09  0.30 0.40 0.84 -0.19 -0.03 -0.45
## Chl  0.02  0.07  0.11 0.09 1.00  0.10 0.20 0.26 -0.09  0.02 -0.36
## F.S -0.05 -0.10  0.09 0.30 0.10  1.00 0.62 0.29  0.00  0.06 -0.25
Dsplaying the first 6 rows of the newly created matrix showing correlations of features(not incl X or Quality). I will use this matrix to plot a Correlation plot to determine which features are closely related.

This is a correlation plot that shows the correlations of all of the input variables. Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients.

## [1] "Density & Residual Sugar show the highest correlation coefficient of 0.84. This graph display the positive correlation showing that as residual sugar content increases, the density also increases."

## [1] "A lcohol & Residual Sugar show a correlation coefficient of -0.45. This plot shows that as Alcohol percentage increases the amount of sugar decreases."

## [1] "The alcohol & density relationship shows a negative correlation, which means as alcohol content increases, the density of the wine decreases."

## [1] "Free.sulfur.dioxide & total.sulfur.dioxide show a positive correlation, so as one value increases the other also increases."

## [1] "Fixed.acidity & pH show a negative correlation, so as pH increases fixed.acidity decreases."
Scatter plots displaying the strongest correlations found within the correlation plot. The information on display is not very useful toward the relationships I want to investigate. However, I can use these plots in the following section to investigate by adding quality as color in Multivariate plots.

Since I predict Alcohol to be one of the greatest determining factors of a wine's quality rating, I decided to plot Alcohol Content vs. Quality Rating in a box plot. This plot shows that the wines with a higher than average alcohol content are also the wines with the highest quality ratings.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I observed correlations between alcohol & residual.sugar and alcohol & density, but the greatest correlation was between sugar & density.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I noticed that there were correlations in alcohol & density, sugar & density and free.sulfur.dioxide & total.sulfur.dioxide. My predictions did not focus on density or sulfur.dioxide as a factor.

The alcohol & density relationship shows a negative correlation, which means as alcohol content increases, the density of the wine decreases. Density also has a positive relationship with total.sulfur.dioxide, which means as total.sulfur.dioxide increases the density of the wine increases.

What was the strongest relationship you found?

The strongest relationship appeared to be between sugar & density at an 84% correlation, followed by alcohol & density at 78%.

2.3 Multivariate Plots Section

In this section, I chose to plot the same plots from the Bivariate including an overlay of a Purple color pallete representing Quality rating by color intensity. 

The plots of correlated features displayed in this section now show an added layer of color that shows the associated quality rating for each observation. It is hard to judge any trends based on the wide range of quality rating colors, so I created a new value called 'qrating' to display the quality ratings in three categories.

I decided to map quality ratings into three groups: Poor, Average, & Great. The 'Poor' group consists of ratings 3,4,5 which make up 1640/4898 or 33% of the ratings. The 'Average' group contains the 6 rating for 2198/4898 or 45%. The 'Great' group contains the higher tier of wines rated at 7,8,9 to account for 1060/4898 or 22% of the ratings. The dataset states that "the classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones)." So 'Average' is synonomous with 'Normal' for this Analysis, as well as 'Great' equals 'Excellent'.
##    Poor Average   Great 
##    1640    2198    1060

## [1] "Density appears to show a correlation to quality, even though residual.sugar does not."

## [1] "Alcohol appears to show a correlation to quality, even though residual.sugar does not."

## [1] "Alcohol & Density, both show show a correlation to quality, so as alcohol content increases wine quality increases and density decreases."

## [1] "Free.sulfur.dioxide & total.sulfur.dioxide show no clear correlation to wine quality."

This plot displays Fixed Acidity vs. Residual Sugar vs. pH, I wanted to determine if there were correlations between the remaining features I focused on from beginning of the project. There's only a relationship between fixed.acidity and pH, where acidity increases as the pH value decreases.
The following shows the number of observations in each Density category.
##     Low Average    High 
##    1185    2442    1271

This scatter plot shows another representation of the correlation between Alcohol content, quality rating, and density. I also made use of a green color palette, to differentiate the 'dvalue' plot from previous 'qrating' plots.

The density value is categorized as Low (Values < 0.9917), Average (Values between 0.9917 & 0.9961), & High (values > 0.9961).

This representation shows a more useful depiction, which can be interpreted that as the alcohol percentage increases, the level of density falls and vice versa. So both a lower density & higher alcohol content positively affect the quality rating. This plot appears to be the most effective in displaying the strongest correlation, as sugar was found to have no bearing on quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It appears that, Density affects Quality, but Residual Sugar does not. I also observed that Alcohol and Density both display a strong affect on Quality. Alcohol also individually shows the strongest relationship with Quality.

Were there any interesting or surprising interactions between features?

I was suprised that alcohol content also directly affected the density of the wine. I was also suprised that there were very few strong relationships between the inputs & quality rating output. The majority of the plots show a lack of consistency which negates any idea of a strong correlations.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No, I did not.

3. Final Plots and Summary

3.1 Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Description One

Quality rating displayed as an output of the wine observation data in the form of a Histogram. This bar plot of Wine Quality shows that most ratings fall between 5 and 7, which is consistent with the dataset’s description that there are much more normal wines than excellent or poor ones.

3.2 Plot Two

Description Two

Box plot displaying the range of alcohol (%/vol) for each Quality rating as a number. This plot confirms my analysis that the wines with a higher than average alcohol content are also the wines with the highest quality ratings. Had I created qrating earlier in the project, I could have displayed the quality rating data on a simpler plot by using the three tier rating scale that was utilized for testing. ‘qrating’ allows me to display this finding more clearly.

3.3 Plot Three

Description Three

This scatter plot shows another representation of the correlation between Alcohol content, quality rating, and density. I also made use of a green color palette, to differentiate the ‘dvalue’ plot from previous ‘qrating’ plots.

The density value is categorized as Low (Values < 0.9917), Average (between 0.9917 & 0.9961), & High (values > 0.9961).

This representation shows a more useful depiction, which can be interpreted that as the alcohol percentage increases, the level of density falls and vice versa. So both a lower density & higher alcohol content positively affect the quality rating. With a correlation coefficient of -0.78 for Alcohol/Density, -0.31 Quality/Density & 0.43 Quality/Alcohol, this plot appears to be the most effective in displaying the strongest correlations, as sugar was found to have no bearing on quality.
——

4. Reflection

The White Wine dataset contains 4898 observations with 13 features. Of the 13 features there are 11 input features, one output feature, and one unique identifier. The purpose of this Exploratory Data Analysis was to determine which features impact the quality of the wine.

I initially used histograms to display the data, but I was not able to determine any correlations. Scatter plots provided more intuitive visualizations that could be easily decoded. The use of Bivariate Plots allowed me to obtain correlations. These plots helped me determine that residual.sugar & density shared the strongest correlation. I then used their strong correlation to determine if they affected quality jointly or individually. I discovered that density affected quality, but residual.sugar did not. The second strongest correlation was between alcohol & density and they jointly affected White Wine quality. The third strongest correlation was between alcohol & residual.sugar, however my plot again confirmed that residual.sugar did not appear to affect quality. The fourth strongest correlation was between free.sulfur.dioxide & total.sulfur.dioxide, but total.sulfur.dioxide had more impact on quality than free.sulfur.dioxide.

In conclusion, my prediction that the three features fixed.acidity, residual.sugar, & alcohol affect wine quality proved to be incorrect. Of those features, only alcohol was proven to affect wine quality. In addition to alcohol, density was also shown to affect wine quality. Based on these findings, I would use alcohol percentage and density to predict White Wine quality in an future investigations.